Boğaziçi University
ΒΆ

IE582 Homework I
ΒΆ

Selahattin Seha CIRIT - 2023705006

Brief Summary about the homeworkΒΆ

In this homework, Borsa Istanbul stock prices dataset was analyzed. The dataset includes randomly selected stocks with close prices over given period. Thanks to the required steps in homework, detailed data analysis that is from descriptive analysis to dimensionality reduction via PCA method was occurred.

InΒ [Β ]:
# Required libraries imported

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA as PCA_Function
from scipy.stats import kurtosis, skew, trim_mean
InΒ [Β ]:
# Data read, if you would like to run the script in your local environment, please adjust DATA_LOCAL_PATH for your environment. 
# Along the homework, only long format data was used to do proper data analysis. 

DATA_LOCAL_PATH = "/home/ssc/Desktop/BOUN/Courses/Fall-23/IE 582/fall-23-ssehacirit/HW1/"

data = pd.read_csv(DATA_LOCAL_PATH + "all_ticks_long.csv.gz")

# timestamp manipulation to transform given timestamp format into pandas datetime format which is more useful than another.
data["timestamp"] = pd.to_datetime(data.timestamp).dt.tz_convert(None)
data
Out[Β ]:
short_name timestamp price
0 AEFES 2012-09-17 06:45:00 22.3978
1 AEFES 2012-09-17 07:00:00 22.3978
2 AEFES 2012-09-17 07:15:00 22.3978
3 AEFES 2012-09-17 07:30:00 22.3978
4 AEFES 2012-09-17 07:45:00 22.5649
... ... ... ...
2848025 ZOREN 2019-07-22 14:00:00 1.1800
2848026 ZOREN 2019-07-22 14:15:00 1.1800
2848027 ZOREN 2019-07-22 14:30:00 1.1800
2848028 ZOREN 2019-07-22 14:45:00 1.1800
2848029 ZOREN 2019-07-22 15:00:00 1.1700

2848030 rows Γ— 3 columns

Task I - Descriptive AnalysisΒΆ

The dataset was analyzed in terms of some summary stats features like min, max, median, quartiles, std etc. Main idea in this step is feeling the structure of the dataset, understanding distribution of given variables and calculating measures of central tendency, dispersion etc.

InΒ [Β ]:
# Because the dataset is long format, with the groupby function of pandas, calculations were easily made. 
# Required information were provided below.  

data.groupby("short_name").apply(lambda x: pd.Series({
    "N": x["price"].count(),
    "minDate": x["timestamp"].min(),
    "maxDate": x["timestamp"].max(),
    "min_price": x["price"].min(),
    "q25_price": x["price"].quantile(0.25),
    "q50_price": x["price"].quantile(0.5), # also known as 'median',
    "q75_price": x["price"].quantile(0.75),
    "max_price": x["price"].max(),
    "mean_price": x["price"].mean(),
    "std_price": x["price"].std(),
    "skew_price": skew(x["price"], nan_policy="omit"),
    "kurtosis_price": kurtosis(x["price"], nan_policy="omit"),
    "trim-mean_price": trim_mean(x["price"], proportiontocut=0.05),
    "mad_price": abs(x["price"] - x["price"].mean()).mean(),
    "range_price": x["price"].max() - x["price"].min()
}))
Out[Β ]:
N minDate maxDate min_price q25_price q50_price q75_price max_price mean_price std_price skew_price kurtosis_price trim-mean_price mad_price range_price
short_name
AEFES 48131 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 19.16050 20.6465 22.73200 28.5090 20.982235 2.494002 0.306407 -0.445736 20.940443 2.041743 28.5089
AKBNK 49209 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 5.85000 6.3057 6.93250 9.2124 6.473105 0.944955 0.579306 -0.039912 6.446953 0.732379 9.2123
AKSA 48594 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 5.20880 6.9853 8.72000 15.1189 7.127504 2.710033 0.169080 -0.101457 7.100438 2.123153 15.1188
AKSEN 48171 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0000 2.67000 2.9300 3.75000 5.1900 3.183542 0.724332 0.825970 -0.353334 3.144727 0.603710 5.1900
ALARK 48335 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 1.56890 1.9376 2.42140 3.5143 2.060859 0.575943 0.645715 -0.533558 2.035536 0.473908 3.5142
ALBRK 46862 2012-09-17 06:45:00 2019-07-23 15:00:00 1.0255 1.22510 1.3602 1.50000 2.1900 1.365549 0.167824 0.391135 0.186814 1.361010 0.140982 1.1645
ANACM 48165 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 1.04700 1.2597 2.40210 3.5021 1.672102 0.788365 0.712827 -1.049554 1.633785 0.702866 3.5020
ARCLK 49045 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 11.71110 15.0100 19.08770 26.4278 15.388088 4.531459 0.115708 -0.767086 15.350387 3.893553 26.4277
ASELS 48803 2012-09-17 06:45:00 2019-07-23 07:00:00 0.0001 4.94030 9.2757 22.75670 46.7616 13.432535 9.624246 0.697246 -0.917501 12.894643 8.609648 46.7615
ASUZU 48433 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 5.07480 5.9496 7.12000 15.2800 6.467033 2.201036 1.482039 2.347907 6.255543 1.596651 15.2799
AYGAZ 48119 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 5.95150 7.7238 10.26900 13.5935 8.101948 2.610402 0.295531 -1.063786 8.081058 2.252556 13.5934
BAGFS 48650 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 8.26175 10.6100 12.35000 38.4352 10.407127 3.618058 2.966101 21.320628 10.262403 2.413341 38.4351
BANVT 47951 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0000 2.59000 3.7100 11.93000 28.6800 7.628230 6.267278 1.012752 -0.004878 7.079116 5.502986 28.6800
BRISA 48937 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 5.89000 6.7300 7.33000 10.3275 6.544896 1.295321 -0.169787 -0.172206 6.542497 1.025382 10.3274
CCOLA 48749 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 31.97820 34.8215 42.04970 54.2208 36.890707 6.747213 0.609038 -0.673140 36.662293 5.664458 54.2207
CEMAS 46394 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0000 0.70000 0.8700 1.50000 7.0100 1.209088 0.799981 2.354746 7.839219 1.107462 0.579235 7.0100
ECILC 48492 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 1.17230 1.8214 2.78090 4.2278 2.075865 0.973788 0.496998 -1.063596 2.041292 0.856328 4.2277
EREGL 49173 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 2.18120 3.0360 6.75870 10.4710 4.179544 2.690731 0.572196 -1.034466 4.073754 2.360051 10.4709
FROTO 48995 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 21.49380 27.1182 48.51160 65.4192 32.763693 14.732664 0.571357 -1.040751 32.348909 12.778048 65.4191
GARAN 49308 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 7.01540 7.6542 8.67860 12.1554 7.899734 1.249637 0.663114 0.088658 7.852954 0.994845 12.1553
GOODY 48961 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 2.42770 3.1920 3.59660 58.7574 3.102485 0.886456 10.197231 623.321452 3.074397 0.653305 58.7573
GUBRF 49057 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 3.27650 4.2500 5.13000 13.6191 4.328323 1.222988 0.501696 -0.438447 4.275643 1.026014 13.6190
HALKB 49071 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 8.72050 10.6531 13.49090 20.2365 10.919353 3.071563 0.205094 -0.601357 10.865138 2.574486 20.2364
ICBCT 44336 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0000 1.55960 2.0300 4.07000 11.2700 2.828502 1.789883 1.494058 2.611958 2.665473 1.444402 11.2700
ISCTR 49221 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 4.32000 4.8543 5.82030 7.9639 5.126551 1.003386 0.700135 -0.574340 5.079921 0.839786 7.9638
ISDMR 12227 2016-03-28 06:30:00 2019-07-23 15:00:00 1.0181 4.85420 5.9063 6.56070 7.5936 5.351663 1.697918 -1.227264 0.325652 5.473200 1.318965 6.5755
ISFIN 42877 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 0.56390 0.8635 1.67420 9.8300 1.559420 1.764839 2.167203 4.169988 1.310759 1.253275 9.8299
ISYAT 43184 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 0.44120 0.4957 0.63330 1.1500 0.537338 0.160246 0.529581 1.063215 0.533277 0.123556 1.1499
KAREL 46032 2012-09-17 06:45:00 2019-07-22 15:00:00 0.0001 1.53130 1.8200 5.25000 9.4600 3.178023 2.133619 0.859170 -0.689921 3.019454 1.916274 9.4599
KARSN 48527 2012-09-17 06:45:00 2019-07-22 15:00:00 0.0001 1.11000 1.2874 1.47000 2.5000 1.326907 0.290413 1.031105 1.282541 1.306489 0.224876 2.4999
KCHOL 49093 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 9.73680 12.0449 15.16930 19.1500 12.248291 3.181444 0.039270 -1.090014 12.245111 2.715647 19.1499
KRDMB 47532 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 1.56120 2.2007 2.72730 4.4960 2.222798 0.686385 0.494980 -0.497376 2.190607 0.577845 4.4959
KRDMD 49161 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 1.08450 1.3979 2.16900 4.9510 1.768390 0.940092 1.368964 1.130739 1.683724 0.732107 4.9509
MGROS 48903 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 16.66000 19.1100 22.10000 30.2600 19.576424 3.901269 0.466080 -0.338842 19.449764 3.171047 30.2599
OTKAR 48785 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 56.77570 82.8224 105.49880 139.4288 81.419528 27.782825 -0.233590 -0.890270 81.979483 23.389661 139.4287
PARSN 45325 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0000 4.57000 7.8900 10.65000 29.8200 8.276989 4.662471 1.119821 1.930259 7.923042 3.613881 29.8200
PETKM 49184 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 1.28690 2.2845 3.88280 5.7697 2.539237 1.378510 0.472626 -1.122838 2.477695 1.221582 5.7696
PGSUS 45221 2013-04-26 06:30:00 2019-07-23 15:00:00 0.0000 17.79000 25.6400 29.44000 50.6500 24.789487 7.656535 0.215107 -0.321921 24.576927 6.233082 50.6500
PRKME 48466 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 2.38950 2.7400 3.43650 5.4300 2.927109 0.721949 0.659123 -0.643581 2.898181 0.597200 5.4299
SAHOL 49095 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 7.96520 8.6079 9.26820 11.6826 8.615896 0.955310 0.056288 -0.279519 8.608193 0.764254 11.6825
SASA 47633 2012-09-17 06:45:00 2019-07-22 15:00:00 0.0001 0.31920 0.7335 4.94730 8.4260 2.294876 2.492934 0.828215 -1.016584 2.142603 2.244777 8.4259
SISE 49090 2012-09-17 06:45:00 2019-07-22 15:00:00 0.0001 1.92200 2.6682 4.14600 6.9230 3.048367 1.422848 0.511570 -0.777349 2.995630 1.223717 6.9229
SKBNK 47270 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 1.20000 1.5100 1.72070 2.2516 1.473651 0.294908 -0.068399 -1.092395 1.475175 0.260682 2.2515
SODA 48276 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 1.47580 2.6684 4.28610 7.7659 3.189591 2.045764 0.535910 -0.656420 3.123715 1.705455 7.7658
TCELL 49143 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 8.56630 9.7001 11.23640 15.8125 9.828003 2.356250 -0.163725 0.233760 9.865075 1.774657 15.8124
THYAO 49282 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 6.43000 7.7800 12.27000 19.9500 9.288821 4.027293 0.932459 -0.333144 9.069076 3.279337 19.9499
TKFEN 48930 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 4.31900 5.7532 14.24675 27.3200 9.191809 6.667047 1.168818 0.008360 8.627737 5.586694 27.3199
TOASO 48946 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 10.36560 16.5554 20.65130 29.9218 16.597275 6.328241 0.110278 -1.010564 16.544861 5.366081 29.9217
TRKCM 48886 2012-09-17 06:45:00 2019-07-22 15:00:00 0.0001 1.17420 1.6270 2.98260 4.6432 2.027849 1.099667 0.553203 -0.990363 1.996023 0.961232 4.6431
TSKB 48384 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 0.82540 0.9373 1.02440 1.4208 0.945233 0.155276 0.698243 0.122133 0.937418 0.122650 1.4207
TTKOM 49077 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 5.26730 5.7464 6.26000 7.3500 5.660680 0.818598 -0.791310 0.453484 5.701675 0.631715 7.3499
TUKAS 45929 2012-09-17 06:45:00 2019-07-23 15:00:00 0.6500 1.06000 1.5300 2.13000 5.9200 1.737529 0.867095 1.693861 3.236394 1.644571 0.631211 5.2700
TUPRS 49143 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 34.54910 49.5542 93.42870 139.2937 62.994535 32.398117 0.680820 -0.967032 61.312098 28.679349 139.2936
USAK 47659 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 0.95710 1.0500 1.37080 2.7578 1.220452 0.459532 1.295345 0.827238 1.184679 0.354688 2.7577
VAKBN 49212 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 4.03220 4.4742 5.24600 7.5814 4.735438 0.977889 0.845849 -0.073249 4.685568 0.780719 7.5813
VESTL 48781 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0000 4.02000 6.3200 7.45000 14.5400 5.942711 2.830465 0.181898 -0.368116 5.838000 2.239724 14.5400
YATAS 46055 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 0.38860 0.9658 4.23000 10.6748 2.434249 2.552377 1.068910 0.086681 2.210920 2.197288 10.6747
YKBNK 49225 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 2.26820 2.6093 2.87400 3.9581 2.566327 0.422774 -0.240540 0.021049 2.574394 0.341429 3.9580
YUNSA 45528 2012-09-17 06:45:00 2019-07-23 15:00:00 0.0001 3.00670 4.1078 4.72060 9.5275 4.079695 1.347020 0.870869 0.764225 3.990622 1.051336 9.5274
ZOREN 48807 2012-09-17 06:45:00 2019-07-22 15:00:00 0.0001 1.03380 1.2500 1.42650 2.4430 1.248124 0.311330 0.408987 0.273040 1.235133 0.239848 2.4429

Almost all of summary stats provide insights about distributions of given stocks. Mean price, standard deviation of price, skewness, kurtosis, quartiles are helpful to understand better how the stock distributes. Outliers also could be seen with these information. However, rather than value-based analysis, i prefer plot-based analysis which gives overall perspective about stocks. Histograms and box-plots could be used to get insights about stocks.

InΒ [Β ]:
# Stock-based Histogram and its kernel density estimations

g = sns.displot(
    data=data, x="price", col="short_name", 
    col_wrap=5, kde=True, common_bins=False,
    facet_kws={'sharex': False, 'sharey': False},
    facecolor="lightgreen", color="black"
    )
No description has been provided for this image
InΒ [Β ]:
sns.catplot(
    data=data, y="price",
    col="short_name", kind='box', col_wrap=5, sharey=False
)
Out[Β ]:
<seaborn.axisgrid.FacetGrid at 0x7fa185344370>
No description has been provided for this image

From histogram view, while some of stocks looks like leftly skewed or vice versa, it is hard to say anything about others. But in general, multimodal distribution is common among the stock prices

Moreover, from the boxplot view, Median, quartiles (percentile 25th and 75th), interquartile value and outliers were detected. For example, when i look at 'ZOREN', min value were assumed as (25th percentile - 1.5 * IQR), so smaller values than the min value can be labeled as outlier.

In this direction, while some stocks have outliers, other stocks do not include any outlier value.

InΒ [Β ]:
# long to wide format transformation

data_pivot = data.pivot_table(
    index=["timestamp"], columns="short_name", values="price"
    ).reset_index().rename_axis(None, axis=1)
data_pivot
Out[Β ]:
timestamp AEFES AKBNK AKSA AKSEN ALARK ALBRK ANACM ARCLK ASELS ... TTKOM TUKAS TUPRS USAK VAKBN VESTL YATAS YKBNK YUNSA ZOREN
0 2012-09-17 06:45:00 22.3978 5.2084 1.7102 3.87 1.4683 1.1356 1.0634 6.9909 2.9948 ... 4.2639 0.96 29.8072 1.0382 3.8620 1.90 0.4172 2.5438 2.2619 0.7789
1 2012-09-17 07:00:00 22.3978 5.1938 1.7066 3.86 1.4574 1.1275 1.0634 6.9259 2.9948 ... 4.2521 0.96 29.7393 1.0382 3.8529 1.90 0.4229 2.5266 2.2462 0.7789
2 2012-09-17 07:15:00 22.3978 5.2084 1.7102 NaN 1.4610 1.1356 1.0679 6.9909 2.9855 ... 4.2521 0.97 29.6716 1.0463 3.8436 1.91 0.4229 2.5266 2.2566 0.7789
3 2012-09-17 07:30:00 22.3978 5.1938 1.7102 3.86 1.4537 1.1275 1.0679 6.9584 2.9855 ... 4.2521 0.97 29.7393 1.0382 3.8529 1.91 0.4286 2.5324 2.2619 0.7860
4 2012-09-17 07:45:00 22.5649 5.2084 1.7102 3.87 1.4574 1.1356 1.0725 6.9909 2.9760 ... 4.2521 0.97 29.8072 1.0382 3.8620 1.90 0.4286 2.5324 2.2619 0.7789
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
50007 2019-07-23 14:00:00 20.4800 7.7300 9.1400 2.47 3.2300 1.2100 2.8400 20.3000 NaN ... 5.6000 4.34 131.6000 1.0500 4.8600 9.98 5.3500 2.7500 4.2500 NaN
50008 2019-07-23 14:15:00 20.5000 7.7200 9.1400 2.47 3.2200 1.2100 2.8400 20.3200 NaN ... 5.5700 4.35 131.5000 1.0500 4.8600 9.98 5.3400 2.7500 4.2400 NaN
50009 2019-07-23 14:30:00 20.5000 7.7400 9.1300 2.46 3.2300 1.2100 2.8300 20.3400 NaN ... 5.5700 4.36 131.5000 1.0500 4.8600 9.96 5.3400 2.7600 4.2400 NaN
50010 2019-07-23 14:45:00 20.4000 7.7000 9.1400 2.47 3.2400 1.2100 2.8200 20.3800 NaN ... 5.5700 4.35 131.3000 1.0400 4.8600 9.94 5.3400 2.7700 4.2400 NaN
50011 2019-07-23 15:00:00 20.4600 7.7000 9.1400 2.47 3.2300 1.2000 2.8300 20.3200 NaN ... 5.5600 4.34 131.8000 1.0500 4.8500 9.93 5.3300 2.7700 4.2400 NaN

50012 rows Γ— 61 columns

It is known that the dataset also includes NaN observations due to some reasons like public offering dates of stocks. So, stock based sum of nan values were printed below and sorted from higher to lower. ISDMR is the highest nan observations stock. Missing values handling another topic that i have to focus on, but handling missing value is the off-topic for the homework. So, i simply filled these missing values with overall mean of stocks in PCA step by awareing this is completely wrong approach as it could lead to biased comments.

InΒ [Β ]:
# Number of NaN observations per stocks

data_pivot.isna().sum().sort_values(ascending=False)
Out[Β ]:
ISDMR        37785
ISFIN         7135
ISYAT         6828
ICBCT         5676
PGSUS         4791
             ...  
ISCTR          791
YKBNK          787
THYAO          730
GARAN          704
timestamp        0
Length: 61, dtype: int64
InΒ [Β ]:
# pair-wise correlation matrix to understanding relationships between stocks

stock_names = [x for x in list(data_pivot) if "timestamp" not in x]
corr_matrix = data_pivot[stock_names].corr()
corr_matrix.head()
Out[Β ]:
AEFES AKBNK AKSA AKSEN ALARK ALBRK ANACM ARCLK ASELS ASUZU ... TTKOM TUKAS TUPRS USAK VAKBN VESTL YATAS YKBNK YUNSA ZOREN
AEFES 1.000000 0.267502 -0.000625 0.528379 0.319894 0.496981 0.117802 -0.335744 0.032488 0.442354 ... 0.255756 -0.204048 -0.100480 0.083471 0.397543 -0.150159 0.235401 0.538889 0.475578 0.061791
AKBNK 0.267502 1.000000 0.572321 0.548420 0.657648 0.165933 0.439169 0.644439 0.583853 0.587298 ... 0.578765 0.315863 0.384826 0.785438 0.942930 0.484747 0.593489 0.516761 0.083837 0.627474
AKSA -0.000625 0.572321 1.000000 0.280772 0.636207 -0.045111 0.712541 0.735731 0.844647 0.760313 ... 0.131398 0.561769 0.732025 0.728294 0.512615 0.776674 0.821989 -0.169104 0.394182 0.701427
AKSEN 0.528379 0.548420 0.280772 1.000000 0.470681 0.283348 0.469720 0.002287 0.439348 0.594102 ... 0.211780 -0.077261 0.235223 0.581016 0.606355 0.136128 0.456702 0.375938 0.451502 0.371093
ALARK 0.319894 0.657648 0.636207 0.470681 1.000000 0.158375 0.736034 0.464037 0.752573 0.719034 ... 0.080230 0.498251 0.655195 0.652558 0.653074 0.509264 0.791923 0.161212 0.374540 0.391374

5 rows Γ— 60 columns

InΒ [Β ]:
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

fig, ax = plt.subplots(figsize=(15, 12))
cmap = sns.color_palette("vlag", as_cmap=True)
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
            square=True, linewidths=.7, cbar_kws={"shrink": .4})
Out[Β ]:
<AxesSubplot:>
No description has been provided for this image

By investigating heatmap plot, i chose three pairs of stocks in terms of having positive, negative and no correlation. In each category, randomly selected pairs are just representation of their category. Selection decision was not made by considering their correlation values, was made by just their cell colors

  • SISE & TEKFN which have positive correlation between each other
  • TCELL & GUBRF which do not have any correlation between each other and,
  • ISDMR & BAGFS which have negative correlation between each other.

Task 2 - Moving Window CorrelationΒΆ

Due to interesting pairs of stocks selected from the previous section, i will dive deep with these pairs of stocks in moving window correlation step.

1 Month, 3 Month and 6 Month moving window period were selected to analyse rolling correlation of these pairs of stocks.

InΒ [Β ]:
PERIOD_SPACE = ["30D", "90D", "180D"]
PAIR1 = ("SISE", "TKFEN")
PAIR2 = ("TCELL", "GUBRF")
PAIR3 = ("ISDMR", "BAGFS")
InΒ [Β ]:
def calculate_rolling_corrs(
        data: pd.DataFrame, stock_pairs: tuple, period_space: list
        ) -> pd.DataFrame:
        """
        returns a dataframe that includes rolling correlation vals over desired window size
        """
    
        stock1, stock2 = stock_pairs

        rolling_ops_df = data.set_index("timestamp")

        rolling_cors = []
        for prd in period_space:
            temp_rolling_cor = (rolling_ops_df[stock1]
             .rolling(prd)
             .corr(rolling_ops_df[stock2])
             .reset_index(name="Rolling Corr Values")
            )
            temp_rolling_cor["Window Size"] = f"{prd} Rolling Corr" 
            rolling_cors.append(temp_rolling_cor)

        rollings_cors_df = pd.concat(rolling_cors, ignore_index=True)
        rollings_cors_df = rollings_cors_df.pivot_table(
            index="timestamp", columns="Window Size", values="Rolling Corr Values", dropna=False
            ).reset_index().rename_axis(None, axis=1)
        
        return rollings_cors_df
InΒ [Β ]:
rolling_corrs_pair1 = calculate_rolling_corrs(
    data_pivot, PAIR1, PERIOD_SPACE 
)
ax = rolling_corrs_pair1.plot(x="timestamp", figsize=(25, 5))
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR1} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
No description has been provided for this image
InΒ [Β ]:
rolling_corrs_pair2 = calculate_rolling_corrs(
    data_pivot, PAIR2, PERIOD_SPACE 
)
ax = rolling_corrs_pair2.plot(x="timestamp", figsize=(25, 5))
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR2} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
No description has been provided for this image
InΒ [Β ]:
rolling_corrs_pair3 = calculate_rolling_corrs(
    data_pivot, PAIR3, PERIOD_SPACE 
)
ax = rolling_corrs_pair3.plot(x="timestamp", figsize=(25, 5))
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR3} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
No description has been provided for this image

It's time to evaluate the these 3 pairs of stocks with respect to plot outputs

  • (SISE, TKFEN)

  • (TCELL, GUBRF)

  • (ISDMR, BAGFS)

All of mentioned pairs have common trends in 1M, 3M and 6M rolling correlation lines. All of them are sometimes goes up and sometimes goes down. Certainly, their characteristics differs from each other. In short, seasonal patterns can be seen from the plots. This may originated from market manipulation, news about the company sectors or some bureucratic talks. These patterns will be validated in Google Trends analysis.

Task 3 - PCA ApplicationΒΆ

Dimensionality reduction method was applied via Principal Component Analysis (PCA). Before the application of PCA, standard scaling of features are essential to fair model evaluation. However, imputing nan observations is the first step for this task because PCA does not work with nan observations.

InΒ [Β ]:
imputer = SimpleImputer(strategy="mean")
imputed_data = imputer.fit_transform(data_pivot[stock_names])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(imputed_data)

scaled_data
Out[Β ]:
array([[ 0.57857848, -1.34926492, -2.02795754, ..., -0.05370799,
        -1.41440415, -1.52566733],
       [ 0.57857848, -1.3648411 , -2.02930519, ..., -0.09471605,
        -1.42662012, -1.52566733],
       [ 0.57857848, -1.34926492, -2.02795754, ..., -0.09471605,
        -1.41852801, -1.52566733],
       ...,
       [-0.19710197,  1.3516016 ,  0.74963037, ...,  0.46175389,
         0.12473097,  0.        ],
       [-0.23797458,  1.30892714,  0.75337385, ...,  0.48559579,
         0.12473097,  0.        ],
       [-0.21345102,  1.30892714,  0.75337385, ...,  0.48559579,
         0.12473097,  0.        ]])
InΒ [Β ]:
pca_model = PCA_Function(random_state=3169)
pca_model.fit(scaled_data)
Out[Β ]:
PCA(random_state=3169)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(random_state=3169)
InΒ [Β ]:
pca_summary = pd.DataFrame(
    np.vstack(
        [pca_model.get_feature_names_out(), 
         pca_model.explained_variance_ratio_,
         np.cumsum(pca_model.explained_variance_ratio_)]
         )).T
pca_summary.columns = ["PCA Components", "Proportion of Variance", "Cumulative Proportion"]
pca_summary[:20]
Out[Β ]:
PCA Components Proportion of Variance Cumulative Proportion
0 pca0 0.488094 0.488094
1 pca1 0.170283 0.658377
2 pca2 0.107268 0.765646
3 pca3 0.049891 0.815537
4 pca4 0.038081 0.853618
5 pca5 0.021406 0.875023
6 pca6 0.015324 0.890347
7 pca7 0.013280 0.903627
8 pca8 0.010235 0.913862
9 pca9 0.008878 0.922740
10 pca10 0.008343 0.931083
11 pca11 0.007107 0.938190
12 pca12 0.006225 0.944415
13 pca13 0.005945 0.950360
14 pca14 0.004796 0.955156
15 pca15 0.004549 0.959705
16 pca16 0.003477 0.963182
17 pca17 0.003088 0.966270
18 pca18 0.002632 0.968903
19 pca19 0.002611 0.971514
InΒ [Β ]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.bar(x=pca_summary["PCA Components"], height=pca_summary["Proportion of Variance"])
ax.set_ylabel("Proportion of Variance")
ax.set_xlabel("Components")
ax.set_title("Proportion of Variance per PCA Components")
ax2 = ax.twinx()
ax2.plot(pca_summary["PCA Components"], pca_summary["Cumulative Proportion"], label="Cumulative Variance", color="red", marker="o")
ax2.legend()
fig.autofmt_xdate(rotation=90)
No description has been provided for this image
InΒ [Β ]:
col_names = [f"pca{x}" for x in range(0, 60)]
pca_loadings = pd.DataFrame(pca_model.components_, columns=col_names)
pca_loadings.insert(0, "Stocks", list(data_pivot[stock_names]))
pca_loadings.set_index("Stocks")
pca_loadings.head()
Out[Β ]:
Stocks pca0 pca1 pca2 pca3 pca4 pca5 pca6 pca7 pca8 ... pca50 pca51 pca52 pca53 pca54 pca55 pca56 pca57 pca58 pca59
0 AEFES -0.010619 -0.115330 -0.162991 -0.074086 -0.139545 -0.008576 -0.168831 -0.124655 -0.176649 ... 0.011291 -0.115707 -0.169414 -0.130355 -0.097293 -0.157651 -0.168918 0.037678 -0.087511 -0.131610
1 AKBNK -0.164429 -0.212004 -0.033005 -0.159788 -0.081720 -0.112583 0.058363 -0.032705 0.003778 ... -0.268609 0.065278 0.096402 -0.142497 -0.249343 0.035116 -0.017233 -0.279360 -0.000860 -0.097437
2 AKSA -0.261382 0.073484 0.068234 -0.147875 -0.106180 -0.203374 -0.094810 0.251670 0.005959 ... 0.110387 0.057695 -0.011123 0.079051 0.022534 0.082714 -0.071226 -0.055424 -0.267015 0.069573
3 AKSEN 0.014416 -0.094138 0.076323 -0.196707 -0.158188 0.248636 -0.086114 -0.038360 -0.090251 ... 0.114039 0.104382 -0.033534 -0.177370 -0.075411 0.135683 -0.045575 0.010490 0.188490 0.207352
4 ALARK 0.015394 -0.105984 0.066446 0.236019 -0.223156 -0.122610 -0.011328 -0.123666 0.063168 ... -0.066522 -0.320755 -0.067071 0.091815 -0.102075 -0.027652 0.004609 -0.123473 0.161817 0.143059

5 rows Γ— 61 columns

InΒ [Β ]:
# Factor loadings heatmap

pca_loadings_corr_matrix = pca_loadings.set_index("Stocks")
mask_pca = np.triu(np.ones_like(pca_loadings_corr_matrix, dtype=bool))

fig, ax = plt.subplots(figsize=(15, 12))
cmap = sns.color_palette("vlag", as_cmap=True)
sns.heatmap(pca_loadings_corr_matrix, cmap=cmap, vmax=pca_loadings_corr_matrix.max().max(), vmin=pca_loadings_corr_matrix.min().min(), center=0,
            square=True, linewidths=.7, cbar_kws={"shrink": .4})
Out[Β ]:
<AxesSubplot:ylabel='Stocks'>
No description has been provided for this image

If I have to discuss briefly of the PCA outputs, Proportion of Variance (explained variance) plot provides an intuition about which components represents data variability. So, First 9 components of PCA represents nearly 90% of dataset. Also, factor loadings heatmap gives information about stock correlation per components. For example, in component 0, AYGAZ, BAGFS, BRISA, CEMAS, ISYAT have higher correlation than other stocks. It also means that these type of stocks contributes more than others in selected PCA component. With the domain knowledge such as sector of these companies or strategic partnerships among these stocks would describe more efficiently in component based correlations. Components could capture these spesific information.

Task 4 - Google Trends AnalysisΒΆ

Google Trends insights were used to validate personal observations from the applied analysis above.

InΒ [Β ]:
def get_gtrend_rolling_correlation(pair):
    """processed google trends raw data for calculating 6-month rolling correlation"""
    stock1, stock2 = pair
    filename = "_".join(list(pair))
    data = pd.read_csv(DATA_LOCAL_PATH + f"multiTimeline_{filename}.csv").reset_index().iloc[1:, :]
    data.columns = ["date", f"{stock1}_Popularity", f"{stock2}_Popularity"]
    data["date"] = pd.to_datetime(data.date)
    data.set_index("date", inplace=True)
    gtrend_rolling_corr = data[f"{stock1}_Popularity"].rolling(6).corr(data[f"{stock2}_Popularity"]).reset_index(name="gtrend_val")
    
    return gtrend_rolling_corr
InΒ [Β ]:
gtrend_rolling_corr = get_gtrend_rolling_correlation(PAIR1)

rolling_corrs_pair1 = calculate_rolling_corrs(
    data_pivot, PAIR1, PERIOD_SPACE 
)
ax = rolling_corrs_pair1.plot(x="timestamp", figsize=(25, 5))
ax.plot(gtrend_rolling_corr.date, gtrend_rolling_corr.gtrend_val, color="black", label="GTrend 6-Month Rolling Correlation")
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR1} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
No description has been provided for this image
InΒ [Β ]:
gtrend_rolling_corr = get_gtrend_rolling_correlation(PAIR2)

rolling_corrs_pair2 = calculate_rolling_corrs(
    data_pivot, PAIR2, PERIOD_SPACE 
)
ax = rolling_corrs_pair2.plot(x="timestamp", figsize=(25, 5))
ax.plot(gtrend_rolling_corr.date, gtrend_rolling_corr.gtrend_val, color="black", label="GTrend 6-Month Rolling Correlation")
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR2} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
No description has been provided for this image
InΒ [Β ]:
gtrend_rolling_corr = get_gtrend_rolling_correlation(PAIR3)

rolling_corrs_pair3 = calculate_rolling_corrs(
    data_pivot, PAIR3, PERIOD_SPACE 
)
ax = rolling_corrs_pair3.plot(x="timestamp", figsize=(25, 5))
ax.plot(gtrend_rolling_corr.date, gtrend_rolling_corr.gtrend_val, color="black", label="GTrend 6-Month Rolling Correlation")
ax.axhline(0, label="Ref Line", ls="--", alpha=0.69)
ax.set_title(f"{PAIR3} & 1M, 3M and 6M Rolling Correlation Values over Time")
ax.legend()
plt.show()
No description has been provided for this image

Although rolling correlation of stock pair trends are generally similar, their values vary most often. The reason why this situation occurs may be "search volume and financial market dynamics completely different from each other."

ConclusionΒΆ

We attempted to thoroughly analyze 60 Borsa Istanbul stocks and their closing prices from nearly 2012 to 2019 for this assignment. We mainly focused on descriptive analysis, moving correlation analysis with selected pairs of stocks, and then dimensionality reduction (PCA). Finally, using the information about relationship between selected pairs of stocks we gathered from Google Trend, we attempted to draw a conclusion.